FIRST: Many thanks to An Bui & Sam Csik’s tidyverse tutorial, Dr. Simona Picardi’s tidyverse chapter, & Dr. Alison Horst’s tidyverse aRt which helped immensely in preparing this document.

Data Wrangling with the tidyverse

What is the tidyverse?

Artwork by the incredible Alison Horst

The tidyverse is the bread and butter of data tidying, cleaning, organizing, manipulating, etc. It’s a toolkit designed for data science. All of the packages contained within the tidyverse share an underlying philosophy, grammar, and data structure, making it easy to streamline data manipulation and visualization across multiple, unrelated datasets. Read more about this incredible toolkit here!

The core tidyverse includes the following packages:

  • dplyr for data manipulation;
    • primary functions: arrange(), filter(), group_by(), mutate(), select(), summarize()
  • tidyr for transforming data to a tidy format;
    • primary functions: gather() spread() *readr for reading in rectangular data (e.g., .csv);
    • primary functions: read_csv(), readRDS()
  • ggplot2 for plotting/graphics;
    • there is SO MUCH to cover, here; this is a good place to start (Wickam & Grolemund 2017)
  • stringr for manipulating character strings;
    • primary functions: str_detect(), str_count(), str_subset(), str_extract(), str_replace(), str_match(), str_split()
  • tibble for re-engineered alternatives to data frames;
    • primary functions: as_tibble(), tibble()
  • purr for functional programming;
    • primary functions: map()
  • forcats for working with categorical variables;
    • primary functions: fct_reorder(), fct_infreq(), fct_relevel(), fct_lump()
  • magrittr for sequential modification of a data frafme;
    • primary functions: %>%

These are all distinct packages, and must be installed and loaded separately, but again, share common grammar, syntax, and data structures.

Load the tidyverse

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Data wrangling cheat sheet:

Below are reproducible examples of commonly used tidyverse functions.

Let’s first create some completely hypothetical data about the number of times we have each eaten at Paka over the past four years!

# NOTE: this data is NOT tidy (i.e. in wide format, where each row represents three observations, not one)
paka_data <- tribble(
  ~name,    ~`2017`,   ~`2018`,   ~`2019`, ~`2020`, # R doesnt' love vars named as numbers, so wrap them in backquotes! 
  "Britta",       25,        20,        16,        27,   # or avoid the problem by beginning var names with characters 
  "Lusungu",        20,        15,        11,        12,   # (e.g. "year_2017")
  "King",   18,        17,        10,        8,
  "Philip",     19,        10,        14,        22,
  "Lawrence",     34,        15,        9,        17,
  "Chisomo",     21,        13,        14,        11
  )

It’s a REALLY good habit to ALWAYS explore your data before starting to wrangle it; form this habit, now (you’ll thank yourself down the line!):

str(paka_data) # view data structures of paka_data; i.e. what is the type (e.g., character, numeric, factor) of each of the columns
colnames(paka_data) # view column names of paka_data
head(paka_data) # view first 10 rows of paka_data 

Before we do anything else, we’ll first want to transform ‘paka_data’ into a tidy (long) format. Tidy data follows three general rules: 1) variables are in columns (THINK: YEAR, AGE, INCOME, etc.), 2) observations are in rows (THINK: COUNTY, RESPONDENT), and 3) values are in cells (THINK: COUNT, USD$, CATEGORIES). Having tidy data allows us to use the same tools in similar ways for different datasets… which means we don’t have to reinvent the wheel every time we do analyses, but can translate code/knowledge about toolsets to perform similar tasks on different data. We can’t do this with messy data.

Artwork by the incredible Alison Horst

pivot_longer() transforms data from untidy, wide, to long format (NOTE: this function updates gather(), which is no longer under active development)

tidy_paka <- paka_data %>% 
  pivot_longer(cols = c(`2017`, `2018`, `2019`, `2020`), names_to = "year", values_to = "eat_days")

Conversely, you can transform ‘tidy_paka’ back to wide format.

Most R functions prefer long format, tidy data, hence the “tidy” in "tidy"verse(), and long format data typically eases data processing, but there are cases where wide format data is preferred (e.g., visualizing data in tables for human comprehension):

pivot_wider() transforms data from long to wide format (NOTE: this function updates spread(), which is no longer under active development)

# let's convert our 'tidy_paka' data back to wide format using spread()
back_to_wide <- tidy_paka %>% 
  pivot_wider(names_from = year, values_from = eat_days)

From here on, we’ll be working with our tidy data i.e. tidy_paka to practice some useful wrangling functions. But first, an interlude about wrangling.

Data wrangling refers to the art of getting your data into R in a useful form for visualization and modeling. & it is definitely an art…and a science…and sometimes takes some brute force. It can be a LOT of work; a LOT of HARD work. But tidyverse and our use of tidy datasets have made this work MUCH, MUCH more predictable and user-friendly.

Artwork by the incredible Alison Horst

Subsetting data:

select() selects columns to retain and specifies their order in the data.frame()

names_paka <- tidy_paka %>% 
  select(name, eat_days)

filter() selects observations within columns given some criteria

britta_king <- tidy_paka %>% 
  filter(name == "Britta" | name == "King") # "|" tells R to filter any observations that match "Britta" OR "King"

britta_king_alt <- tidy_paka %>% 
  filter(name %in% c("Britta", "King")) # another way of filtering; this is nice if you want to filter through many unique column attributes

not_britta <- tidy_paka %>% # 
  filter(name != "Britta") # != tells R to filter any observations that DO NOT match "Britta"

pull() pulls out a single variable from a data frame and saves it as a vector

eat_days_vec <- tidy_paka %>% 
  pull(eat_days)

Manipulating/adding variables:

arrange() orders observations as specified (default = alphabetical or ascending)

ordered_names <- tidy_paka %>% 
  arrange(name) # for descending alphabetical order, use "arrange(desc(names))"

ordered_num_eatdays <- tidy_paka %>% 
  arrange(eat_days) # for descending order, use "arrange(-eat_days)"

rename() renames a column

renamed_paka <- tidy_paka %>% 
  rename(total_eatdays = eat_days)

mutate() is SUPER versatile function; it can be used to calculate a new value using existing observations and place this value in a new column, it can be used to add a column based on existing observations, to coerce a variable to a different type, etc. Below are a few examples of its usefulness!

# use mutate() to calculate a new value using existing observations and add this new value to a new column
eatdays_per_month <- tidy_paka %>% 
  mutate(eatdays_per_month = eat_days/12)

# use mutate in conjunction with case_when to add a column based off existing observations
fav_eatdays <- tidy_paka %>% 
  mutate(
    fav_food = case_when(
      name == "Britta" ~ "samosa",
      name == "Lusungu" ~ "salad",
      name == "King" ~ "champa",
      name == "Philip" ~ "burger",
      name == "Lawrence" ~ "samosa",
      name == "Chisomo" ~ "champa"
    )
  )

# use mutate in conjunction with ifelse, where if the observation in the 'name' column matches "Britta", "King", or "Philip", report "yes", they are soda drinkers If not, report "no", they are not
soda_drinkers <- tidy_paka %>% 
  mutate(soda = ifelse(name %in% c("Britta", "King", "Philip"), "yes", "no")) 

# use mutate() to coerce a variable to a different data type
name_as_factor <- tidy_paka %>% 
  mutate(name = as_factor(name)) # you can check that this worked by viewing 'str(name_as_factor)'

Summarizing data:

group_by() groups observations such that data operations are performed at the level of the group; this is SUPER useful if you want to complete analyses by age class, or sex, for instance.

grouped_names <- tidy_paka %>% 
  group_by(name) # notice that nothing appears to change when you view 'grouped_names' Grouped data is sort of a phantom phenom, the data sits grouped under the hood, but doesn't appear as such in any R interface. Not until you preform a function with the grouped data.... See the summarize() function below.

summarize() calculates summary statistics; this is also SUPER useful. Wanna find the mean? Median? Mode? Minimum? Maximum? Standard deviation? summarize() has your back!

paka_summary <- tidy_paka %>% 
  group_by(name) %>% 
  summarize(
    avg_eatdays = mean(eat_days), # feel free to substitute any summary stat function here!!
    max_eatdays = max(eat_days),
    min_eatdays = min(eat_days) # and add as many as you want to calculate!
  )

tally() sums values across groups; use this function to find ‘n’

tallied_paka <- tidy_paka %>% 
  group_by(name) %>% 
  tally(eat_days)

Now let’s practice on some REALish data!

Load the tidyverse and any additional required packages:

library(tidyverse)
g <- read.csv("data/gapminder_data.csv")

Explore:

We should first familiarize ourselves with the data.

dim(g) # view dimensions of the df
head(g) # view first 10 rows of df
tail(g) # view last 10 rows of df
str(g) # view data structure of df
colnames(g) # view the columns of df

Wrangle:

This dataset is pretty big–we’ll want to wrangle it so that it only includes the information that we’re interested in. We will:

  1. filter for the African continent
  2. select relevant columns of data
  3. rename columns
  4. create new columns.

To demonstrate these individual steps, we’ll perform each function separately. Notice that we perform subsequent function calls on the data frame generated from the prior step. At the end, we’ll show you how to combine all steps into a single, succinct code chunk. Creating efficient workflows by combining multiple data wrangling steps is one of the great POWERS of tidyverse!

a. filter for the African continent

africa <- g %>% 
  filter(continent %in% c("Africa"))

b. select the columns we want

Let’s select only the columns we’re interested in.

select_columns <- africa %>% 
  select(1:3,lifeExp, gdpPercap) # you can supply a range of columns, or specify them individually

c. rename columns

To make this even more manageable, we can change column names to something easier (i.e., shorter to type). For example:

rename_columns <- select_columns %>% 
  rename(gdp = gdpPercap)

d. create new columns

We can also create new columns:

  1. based conditionally on other columns; OR,
  2. by preforming some calculation.
# conditional column addition
income_class <- rename_columns %>% 
  mutate(income_class = case_when(
              gdp  < 1000 ~ "low",
              (gdp >= 1000 & gdp <= 6000) ~ "middle",
              gdp > 6000 ~ "high"))

# preform some operation/calculation

Now let’s pull all of these steps together!

We split each wrangling step up into a separate data frame, but you could have linked all these functions together in one chunk using the pipe operator ( %>% ), like this:

africa_simple <- g %>% 
  filter(continent %in% c("Africa")) %>% 
  select(1:3,lifeExp, gdpPercap) %>%  
  rename(gdp = gdpPercap) %>% 
  mutate(income_class = case_when(
              gdp  < 1000 ~ "low",
              (gdp >= 1000 & gdp <= 6000) ~ "middle",
              gdp > 6000 ~ "high"))

# save data
saveRDS(africa_simple, "./out-data/africa-tidy.RDS")